arXiv:l502.07073v3 [cs.LG] 19Jun2015 


Strongly Adaptive Online Learning 

Amit Daniely* Alon Gonen' Shai Shalev-Shwartz* 

June 22, 2015 


Abstract 

Strongly adaptive algorithms are algorithms whose performance on ev¬ 
ery time interval is close to optimal. We present a reduction that can 
transform standard low-regret algorithms to strongly adaptive. As a con¬ 
sequence, we derive simple, yet efficient, strongly adaptive algorithms for 
a handful of problems. 


1 Introduction 

Coping with changing environments and rapidly adapting to changes is a key 
component in many tasks. A broker is highly rewarded from rapidly adjusting 
to new trends. A reliable routing algorithm must respond quickly to congestion. 
A web advertiser should adjust himself to new ads and to changes in the taste 
of its users. A politician can also benefit from quickly adjusting to changes in 
the public opinion. And the list goes on. 

Most current algorithms and theoretical analysis focus on relatively station¬ 
ary environments. In statistical learning, an algorithm should perform well on 
the training distribution. Even in online learning, an algorithm should usually 
compete with the best strategy (from a pool), that is fixed and does not change 
over time. 

Our main focus is to investigate to which extent such algorithms can be 
modified to cope with changing environments. 

We consider a general online learning framework that encompasses various 
online learning problems including prediction with expert advice, online clas¬ 
sification, online convex optimization and more. In this framework, a learning 
scenario is defined by a decision set D , a context space C and a set C of real¬ 
valued loss functions defined over D. The learner sequentially observes a context 
Ct e C and then picks a decision Xt e D. Next, a loss function l t e C is revealed 
and the learner suffers a loss it{xt )• 

Often, algorithms in such scenarios are evaluated by comparing their perfor¬ 
mance to the performance of the best strategy from a pool of strategies (usually, 
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this pool is simply all strategies that play the same action all the time). Con¬ 
cretely, the regret, Ra{T), of an algorithm A is defined as its cumulative loss 
minus the cumulative loss of the best strategy in the pool. The rationale behind 
this evaluation metric is that one of the strategies in the pool is reasonably good 
during the entire course of the game. However, when the environment is chang¬ 
ing, different strategies will be good in different periods. As we do not want to 
make any assumption on the duration of each of these periods, we would like to 
guarantee that our algorithm performs well on every interval I = \q, s] cz [T]. 
Clearly, we cannot hope to have a regret bound which is better than what we 
have for algorithms that are tested only on I. If this barrier is met, we say that 
the corresponding algorithm is strongly adapting. 

Surprisingly maybe, our main result shows that for many learning prob¬ 
lems strongly adaptive algorithms exist. Concretely, we show a simple “meta- 
algorithm” that can use any online algorithm (that was possibly designed to 
have just small standard regret) as a black box, and produces a new algorithm 
that is designed to have a small regret on every interval. We show that if the 
original algorithm have a regret bound of f?(T), then the produced algorithm 
has, on every interval [g, s] of size r := |/|, regret that is very close to R(t) 
(see a precise statement in Section ITTSl) . Moreover, the running time of the new 
algorithm at round t is just O (log(f)) times larger than that of the original 
algorithm. As an immediate corollary we obtain strongly adaptive algorithms 
for a handful of online problems including prediction with expert advice, online 
convex optimization, and more. 

Furthermore, we show that strong adaptivity is stronger than previously 
suggested adaptivity properties including the adaptivity notion of l 8] and the 
tracking notion of [9]J. Namely, strongly adaptive algorithms are also adaptive 
(in the sense of 0), and have a near optimal tracking regret (in the sense of 
0)- We conclude our discussion by showing that strong adaptivity can not be 
achieved with bandit feedback. 

1.1 Problem setting 

A Framework for Online Learning 

Many learning problems can be described as a repeated game between the 
learner and the environment, which we describe below. 

A learning scenario is determined by a triplet (P, C, £), where D is a decision 
space , C is a set of contexts, and £ is a set of loss functions from D to [0,1]. 
Extending the results to general bounded losses is straightforward. The number 
of rounds, denoted T, is unknown to the learner. At each time t e [T], the 
learner sees a context Ct e C , and then chooses an action Xt e D. Simultaneously, 
the environment chooses a loss function £ t e £. Then, the action a; t is revealed to 
the environment, and the loss function £ t is revealed to the learner which suffers 
the loss We list below some examples of families of learning scenarios. 

1 See a precise definition in Section ll.il Also, see Section 11 .3| for a weaker notion of adaptive 
algorithms that was studied in [8]. 
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• Learning with expert advice [4|. Here, there is no context (formally, 
C consists of a single element), D is a finite set of size N (each element in 
this set corresponds to an expert), and £ consists of all functions from D 
to [0,1]. 

• Online convex optimization [13]. Here, there is no context as well, D 
is a convex set, and £ is a collection of convex functions from D to [0,1]. 

• Classification. Here, C is some set, D is a finite set, and £ consists of 
all functions from D to {0,1} that are indicators of a single element. 

• Regression. Here, C is a subset of a Euclidean space, D = [0,1], and £ 
consists of all functions of the form £(y) = (y — y) 2 for y e [0,1]. 

A learning problem is a quadruple V = ( D,C,£,W ), where W is a benchmark 
of strategies that is used to evaluate the performance of algorithms. Here, each 
strategy w e W makes a prediction x t {w) e V based on some rule. We assume 
that the prediction Xt(w ) of each strategy is fully determined by the game’s 
history at the time of the prediction. I.e., by (ci,£i),..., (ct_i, f?t-i), c*. Usually, 
W consists of very simple strategies. For example, in context-less scenarios 
(like learning with expert advice and online convex optimization), W is often 
identified with D , and the strategy corresponding to x e D simply predicts x 
at each step. In contextual problems (such as classification and regression), 
W is often a collection of functions from C to D (a hypothesis class), and the 
prediction of the strategy corresponding to h : C —> D at time t is simply h(ct). 

The cumulative loss of w e W at time T is L W (T) = Yj t=1 ft{xt{w)) and 
the cumulative loss of an algorithm A is L^(T) = The cumulative 

regret of A is Ra(T) = La{T) — inf ws w L W (T). We define the regret, R-p(T), 
of the learning problem V as the minimax regret bound. Namely, R-p(T) is 
the minimal number for which there exists an algorithm A such that for every 
environment Ra(T) < R-plT). We say that an algorithm A has low regret if 
R A {T) = O (poly (logT) R-p(T)) for every environment. 

We note that both the learner and the environment can make random deci¬ 
sions. In that case, the quantities defined above refer to the expected value of 
the corresponding terms. 

Strongly Adaptive Regret 

Let I = [ q , s] := {q, q + 1,..., s} c [T], The loss of w e W during the interval 
I is L W (I) = ^t{xt(w)) and the loss of an algorithm A during the interval 

I is L a {I) = ’EUqZtixt)- The regret of A during the interval I is Ra(I) = 
La(I) — infmgvv L W (I). The strongly adaptive regret of A at time T is the 
function 

SA-Regret^(r) = max Ra(I) 
i=[q,q+ T ~ i]<=[T] 

We say that A is strongly adaptive if for every environment, SA-Regret^(r) = 
O (poly (logT) • R v {t)). 
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1.2 Our Results 

A strongly adaptive meta-algorithm 

Achieving strongly adaptive regret seems more challenging than ensuring low 
regret. Nevertheless, we show that often, low-regret algorithms can be trans¬ 
formed into a strongly adaptive algorithms with a little extra computational 
cost. 

Concretely, fix a learning scenario ( D , C, C). We derive a strongly adaptive 
meta-algorithm, that can use any algorithm B (that presumably have low regret 
w.r.t. some learning problem) as a black-box. We call our meta-algorithm 
Strongly Adaptive Online Learner (SAOL). The specific instantiation of SAOL 
that uses B as the black box is denoted SAOL B . 

Fix a set W of strategies and an algorithm B whose regret w.r.t. W satisfies 

Rb(T) < C ■ T a , (1) 

where a e (0,1), and C > 0 is some scalar. The properties of SAOL s are 
summarized in the theorem below. The description of the algorithm and the 
proof of Theorem [T| are given in Section [2] 

Theorem 1 

1. For every interval I = \q, s] c pj, 

Rsaol-(I)^ ^TJ c \ I \ a + 40log(s + l)|/|* . 

2. In particular, if a ^ \ and B has low regret, then SAOL 13 is strongly 
adaptive. 

3. The runtime of SAOL at time t is at most log(t + 1) times the runtime 
per-iteration ofB. 

From part El we can derive strongly adaptive algorithms for many online prob¬ 
lems. Two examples are outlined below. 

• Prediction with N experts advice. The Multiplicative Weights (MW) 
algorithm has regret < 2yin (N)T. Hence, for every I = [q, s] c [T], 

#SAOL MW C0 = O ((Vl°g (N) + log(s + 1)) VW) • 

• Online convex optimization with G-Lipschitz loss functions over 
a convex set D <= of diameter B. Online Gradient Descent (OGD) 
has regret =% 3 BG\/T. Hence, for every I = [< 7 , s] cr [T], 

-Rsaol ogd (-0 = O ({BG + log(s + l))Vpn) ■ 
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Comparison to (weak) adaptivity and tracking 

Several alternative measures for coping with changing environment were pro¬ 
posed in the literature. The two that are most related to our work are tracking 
regret [5] and adaptive regret [5] (other notions are briefly discussed in Sec¬ 
tion 11.31) . 

Adaptivity, as defined in [ 8 j, is a weaker requirement than strong adaptivity. 
The adaptive regret of a learner A at time T is niax;c[r] Ra(I)- An algorithm 
is called adaptive if its adaptive regret is O (poly (logT) R-p(T)). For online 
convex optimization problems for which there exists an algorithm with regret 
bound R{T ), [ 8 ] derived an efficient algorithm whose adaptive regret is at most 

R{T) log(T) + O (\J r T log 3 (T)'j , thus establishing adaptive algorithms for many 

online convex optimization problems. For the case where the loss functions are 
a-exp concave, they showed an algorithm with adaptive regret 0(— log 2 (T)) (we 
note that according to our definition this algorithm is in fact strongly adaptive). 
A main difference between adaptivity and strong adaptivity, is that in many 
problems, adaptive algorithms are not guaranteed to perform well on small 
intervals. For example, for many problems including online convex optimization 
and learning with expert advice, the best possible adaptive regret is Q(VT). 
Such a bound is meaningless for intervals of size O(VT). We note that in many 
scenarios (e.g. routing, paging, news headlines promotion) it is highly desired 
to perform well even on very small intervals. 

The problem of “tracking the best expert” was studied in [9] (see also, 
i)- In that problem, originally formulated for the learning with expert ad¬ 
vice problem, learning algorithms are compared to all strategies that shift 
from one expert to another a bounded number of times. They derived an effi¬ 
cient algorithm, named Fixed-Share, which attains near-optimal regret bound 
of Y / T?n(log(T) + \og(N)) versus the best strategy that shifts between ^ m ex¬ 
perts. (Interestingly, a recent work [5] showed that the Fixed-Share algorithm 
is in fact (weakly) adaptive). As we show in Section [3l strongly adaptive algo¬ 
rithms enjoy near-optimal tracking regret in the experts problem, and in fact, in 
many other problems (e.g., online convex optimization). We note that as with 
(weakly) adaptive algorithms, algorithms with optimal tracking regret are not 
guaranteed to perform well on small intervals. 

Strong adaptivity with bandit feedback 

In the so-called bandit setting, the loss functions It is not exposed to the 
learner. Rather, the learner just gets to see the loss, it{xt), that he has suf¬ 
fered. In Section [4] we prove that there are no strongly adaptive algorithms 
that can cope with bandit feedback. Even in the simple experts problem we 
show that for every e > 0 , there is no algorithm whose strongly adaptive re¬ 
gret is O (|/| 1_e • poly(logT)). Investigating possible alternative notions and/or 
weaker guarantees in the bandit setting is mostly left for future work. 
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1.3 Related Work 


Maybe the most relevant previous work, from which we borrow many of our tech¬ 
niques is [2]. They focused on the expert setting and proposed a strengthened 
notion of regret using time selection functions, which are functions from the time 
interval [T] to [0,1]. The regret of a learner A with respect to a time selection 
function I is defined by R A {T) = max ie[Ar ] (ju =1 - Tif=i , 

where £t(i) is the loss of expert i at time t. This setting can be viewed as a gen¬ 
eralization of the sleeping expert setting Ij]. For a fixed set X consisting of M 
time selection functions, they proved a regret bound of 0(-\/T m i n ,j \og(NM)) + 
log(TVM)) witlfl respect to each time selection function I e X. We observe 
that if we let X be the set of all indicator functions of intervals (note that 
\X\ = (g) = 0(T 2 )), we obtain a strongly adaptive algorithm for learning with 
expert advice. However, the (multiplicative) computational overhead of our al¬ 
gorithm (w.r.t. the standard MW algorithm) at time t is 0(log(f)), whereas the 
computational overhead of their algorithm is 0(T 2 ). Furthermore, our setting 
is much more general than the expert setting. 

Another related, but somewhat orthogonal line of work m m nu m studies 
drifting environments. The focus of those papers is on scenarios where the 
environment is changing slowly over time. 


2 Reducing Adaptive Regret to Standard Re¬ 
gret 

In this section we present our strongly adaptive meta-algorithm, named Strongly 
Adaptive Online Learner (SAOL). For the rest of this section we fix a learning 
scenario ( D , C, C) and an algorithm B that operates in this scenario (think of 
I? as a low regret algorithm). 

We first give a high level description of SAOL. The basic idea is to run an 
instance of B on each interval I from an appropriately chosen set of intervals, 
denoted 1. The instance corresponding to I is denoted £>/, and can be thought 
as an expert that gives his advice for the best action at each time slot in I. 
The algorithm weights the various Bi’s according to their performance in the 
past, in a way that instances with better performance get more weight. The 
exact weighting is a variant of the multiplicative weights rule. At each step, 
SAOL picks at random one of the Bf s and follows his advice. The probability 
of choosing each Bj is proportional to its weight. Next, we give more details. 

The choice of I. As in the MW algorithm, the weighting procedure is 
used to ensure that SAOL performs optimally for every I e X. Therefore, the 
choice of X exhibits the following tradeoff. On one hand, X should be large, since 
we want that optimal performance on intervals in X will result in an optimal 
performance on every interval. On the other hand, we would like to keep X 
small, since running many instances of B in parallel will result with a large 

2 where L min j = min 
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computational cost. To balance these desires, we let 

u 

feeNufO} 

where for all k e N u {0}, 

X k = {[i-2 k ,(i + l).2 k ~l] : *gN}. 

That is, each T k is a partition of N\{1,..., 2 k } to consecutive intervals of length 
2 k . We denote by 

ACTIVE(f) := {I el : tel}, 

the set of active intervals at time t. By the definition of l k , for every t < 2 fc 
we have that no interval in l k contains t, while for every t > 2 k we have that a 
single interval in l k contains t. Therefore, 

|ACTIVE(f)| = [log(f)J + 1 . 

It follows that the running time of SAOL at time t is at most (log(t) + 1) times 
larger than the running time of B. On the other hand, as we show in the proof, 
we can cover every interval by intervals from X, in a way that will guarantee 
small regret on the covered interval, provided that we have small regret on the 
covering intervals. 

The weighting method. Let Xt = Xt(I) be the action taken by Bi at time 
t. The instantaneous regret of SAOL w.r.t. Bi at time t is n(I) = £t{xt) — 
£t(xt(I)). As explained above, SAOL maintains weights over the £>/’s. For 
I = [<y,,s], the weight of Bj at time t is denoted w t (I). For t < q, Bi is not 
active yet, so we let w t (I ) = 0. At the “entry” time, t = q, we set w t (I) = rji 
where 

rn ■= min 11 / 2 , l/VFl} • 

The weight at time t e (q, s] is the previous weight times (1 + rji ■ rt-i(I)). 
Overall, we have 

'0 t$I 

w t (I) = < r?/ t = q (2) 

w t -i(I)(l + r)i ■ r t -i(I)) te(q,s] 

Note that the regret is always between [—1,1], and 777 e (0,1), therefore weights 
are always positive during the lifetime of the corresponding expert. Also, the 
weight of Bj decreases (increases) if its loss is higher (lower) than the predicted 
loss. 

The overall weight at time t is defined by 

W t :=2> t (J)= ^ w t (I). 

I el /eACTIVE(t) 
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Finally, a probability distribution over the experts at time t is defined by 


Pt{I) 


M 1 ) 

W t 


Note that the probability mass assigned to any inactive instance is zero. The 
probability distribution p t determines the action of SAOL at time t. Namely, 
we have x t = x t (I) with probability p t (I). A pseudo-code of SAOL is detailed 
in Algorithm [T) 


Algorithm 1 Strongly Adaptive Online Learner (with blackbox algorithm B) 

Initialize: w x {I) = 1 = ^ 

10 o.w. 

for t = 1 to T do 

Let Wt = Xj/eACTIVE(i) w t(I) 

Choose I e ACTIVE(f) w.p. p t {I) = 

Predict x t (I) 

Update weights according to Equation © 

end for 


2.1 Proof Sketch of Theorem [T| 

In this section we sketch the proof of Theorem [L] A full proof is detailed in 
Appendix[A] The analysis of SAOL is divided into two parts. The first challenge 
is to prove the theorem for the intervals in I (see Lemma O. Then, the theorem 
should be extended to any interval (end of Appendix 0 . 

Let us start with the first task. Our first observation is that for every interval 
I, the regret of SAOL during the interval I is equal to 

(SAOL’s regret relatively to Bi + the regret of £>/) (3) 

(during the interval I). Since the regret of £>/ during the interval I is already 
guaranteed to be small (Equation (fljl). the problem of ensuring low regret during 
each of the intervals in I is reduced to the problem of ensuring low regret with 
respect to each of the £>/’s. 

We next prove that the regret of SAOL with respect to the £>/’s is small. 
Our analysis is similar to the proof of [2] [Theorem 16]. Both of these proofs 
are similar to the analysis of the Multiplicative Weights Update (MW) method. 
The main idea is to define a potential function and relate it both to the loss of 
the learner and the loss of the best expert. 

To this end, we start by defining pseudo-weights over the experts (the Bj' s). 
With a slight abuse of notation, we define I(t) = lr te n. For any I = [g, s] e I, 







the pseudo-weight of Bi is defined by: 


w t (I) 


'0 

1 

wt-i(l) • (1 + 77 / • r t _i(/)) 

„w s (/) 


t < q 
t = q 

q < t ^ s + 1 
t > s + 1 


Note that 

wt(l) = r)i ■ I(t ) • w t (I) . 

The potential function we consider is the overall pseudo-weight at time t, W t = 
The following lemma, whose proof is given in the appendix, is a 
useful consequence of our definitions. 

Lemma 1 For every t > 1, 

W t t(log(f) + 1) . 

Through straightforward calculations, we conclude the proof of Theorem |T] for 
any interval in I. 

Lemma 2 For every I = [g, s] e I, 

S 

Yi rt ( J ) ^ 51 °s0 + l )VW\ ■ 

t=q 

Hence, according to Equation m, 

Rsaol*(I ) < C ■ |/|“ + 5 log(s + 1) VU1 


The proof is given in the appendix. 

The extension of the theorem to any interval relies on some useful properties 
of the set I (see Lemma [H in the appendix). Roughly speaking, any interval 
I ^ [T] can be partitioned into two sequences of intervals from I, such that the 
lengths of the intervals in each sequence decay at an exponential rate (LemmalD 
in the appendix). The theorem now follows by bounding the regret during 
the interval / by the sum of the regrets during the intervals in the above two 
sequences, and by using the fact that the lengths decay exponentially. 

3 Strongly Adaptive Regret Is Stronger Than 
Tracking Regret 

In this section we relate the notion of strong adaptivity to that of tracking re¬ 
gret, and show that algorithms with small strongly adaptive regret also have 
small tracking regret. Let us briefly review the problem of tracking. For sim¬ 
plicity, we focus on context-less learning problems, and on the case where the 
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set of strategies coincides with the decision space (though the result can be 
straightforwardly generalized). Fix a decision space D and a family C of loss 
functions. A compound action is a sequence a = (eri,. .., <tt) e D T . Since 
there is no hope in competing w.r.t. all sequence^], a typical restriction of the 
problem is to bound the number of switches in each sequence. For a positive 
integer m, the class of compound actions with at most m switches is defined by 

B m = j<7 G D t : s(ct) := m| . (4) 

The notions of loss and regret naturally extend to this setting. For example, the 
cumulative loss of a compound action a e B m is defined by L a (T ) = Yj=i ^t( cr t)- 
The tracking regret of an algorithm A w.r.t. the class B m is defined by 

Tracking-Regret^(T) = L^(T) — inf L a (T) . 

creBm 

The following theorem bounds the tracking regret of algorithms with bounds on 
the strongly adaptive regret. In particular, of SAOL. 

Theorem 2 Let A be a learning algorithm with SA-Regret^r) < Cr a . Then, 
Tracking-Regret^{T) < CT a m}~ a 

Proof Let a e B m . Let be the intervals that correspond to a. 

Clearly, the tracking regret w.r.t. a is bounded by the sum of the regrets of 
during the intervals 7i,..., I m . Hence, and using Holder’s inequality, we have 

m 

La(T) - L a {T) ^ ^ R AIi) 

i =1 


m 

i= 1 



< Cm}~ a T a 


Recall that for the problem of prediction with expert advice, the strongly 
adaptive regret of SAOL (with, say, Multiplicative Weights as a black box) is 
O ^(-yHn(IV) + log(T))vrj . Hence, we obtain a tracking bound of O ^(y'ln(A^) + log (T))\/mfj . 
Up to a V'log(P) factor, this bound is asymptotically equivalent to the bound of 

3 It is easy to prove a lower bound of order T for this problem 
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the Fixed-Share Algorithm of E0- Also, up to log(T) factor, the bound is opti¬ 
mal. One advantage of SAOL over Fixed-Share is that SAOL is parameter-free. 
In particular, SAOL does not need to kno\\0 to. 


4 Strongly Adaptive Regret in The Bandit Set¬ 
ting 

In this section we consider the challenge of achieving adaptivity in the bandit 
setting. Following our notation, in the bandit setting, only the loss incured 
by the learner, £ t (x t ), is revealed at the end of each round (rather than the 
loss function, £ t ). For many online learning problems for which there exists 
an efficient low-regret algorithm in the full information model, a simple reduc¬ 
tion from the bandit setting to the full information setting (for example, see 
[f2] [Theorem 4.1]) yields an efficient low-regret bandit algorithm. Furthermore, 
it is often the case that the dependence of the regret on T is not affected by the 
lack of information. For example, for the Multi-armed bandit (MAB) problem 
[I] (which is the bandit version of the the problem of prediction with expert ad¬ 
vice), the above reduction yields an algorithm with near optimal regret bound 
of 2^/TN log N. 

A natural question is whether adaptivity can be achieved with bandit feed¬ 
back. Few positive results are known. For example, applying the aforementioned 
reduction to the Fixed-Share algorithm results with an efficient bandit learner 
whose tracking regret is O y / Tm(ln(iV) + In (T))n\. 

The next theorem shows that with bandit feedback there are no algorithms 
with non-trivial bounds on the strongly adaptive regret. We focus on the MAB 
problem with two arms (experts) but it is easy to generalize the result to any 
nondegenerate online problem. Recall that for this problem we do not have a 
context, W = D = {ei,e 2 } and C = [0, l] p . 

Theorem 3 For all e > 0, there is no algorithm for MAB with strongly adaptive 
regret of O (T 1_e poly (log T)). 

The idea of the proof is simple. Suppose toward a contradiction that A is an 
algorithm with strongly adaptive regret of O (r 1_e poly (logT)). This means 
that the regret of A on every interval I of length Ti is non trivial (i.e. o(]/])). 
Intuitively, this means that both arms must be inspected at least once during /. 
Suppose now that one of the arms is always superior to the second (say, has loss 
zero while the other has loss one). By the above argument, the algorithm will 
still inspect the bad arm at least once in every T a time slots. Those inspections 
will result in a regret of A- = T x ~i. This, however, is a contradiction, since 

4 For the comparison, we rely on a simplified form of the bound of the 
Fixed-Share algorithm. This simplified form can be found, for example, in 

http://web.eecs.umich.edu/~jabernet/eecs598course/web/notes/lec5_091813.pdf 

a The parameters of Fixed-Share do depend on m 


11 





the strongly adaptive regret bound implies that the standard regret of A is 

0(^-5). 

This idea is formalized in the following lemma. It implies Theorem[3]as for A 
with strongly adaptive regret of O (T 1_£ poly (logT)) we can take k = O (T 1_ t) 
and reach a contradiction as the lemma implies that on some segment I of size 
-jr = Q (Ts), the regret ofAisfl (T%) which grows faster than |/| 1_£ poly(log T) 

Lemma 3 Let A be an algorithm with regret bounded 

Ra(T) ^k = k(T ) , 

Then, there exists an interval I c [T] of size Ll(T/k) with 

R a (I) = fi(|/|) . 

Proof Assume for simplicity that 4 k divides T. Consider the environment E° 
, in which Vi, £ t (e 1 ) = 0.5,£ t (e2) = 1. Let U a [T] be the (possibly random) 
set of time slots in which the algorithm chooses e 2 when the environment is 
E°. Since the regret is at most k, we have E[|I7|] =% 2k. It follows that for 
some segment I a [T] of size ^ £ we have E[|[7 n 7|] < | Indeed, otherwise, 
if [T] = I\ w ... w I±k is the partition of the interval [T] into 4 k disjoint and 
consecutive intervals of size ^ we will have E[|t/|] = X! J= i E[|E^ n Ij\\ > 2/s. 

Now, since \U n I\ is a non-negative integer, w.p. ^ ^ we have \U n J| = 0. 
Namely, w.p. ^14 does not inspect e 2 during the interval I when it runs 
against E°. Consider now the environment E that is identical to E °, besides 
that Vi e I, l t (e 2 ) = 0. By the argument above, w.p. ^ the operation of A 
on E is identical to its operation on E°. In particular, the regret on / when A 
plays against E is, w.p. ^ and in total, > | | • j/|. ■ 
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A Proof of Theorem [1] 

A.l Proving Theorem |T| to Any Interval in X 

Proof (of Lemma |T|) The proof is by induction on t. For t = 1, we have 


Wi=ti)i([l,l]) = l • 

Next, we assume that the claim holds for any t' < t and prove it for t + 1. Since 
|{[g, s] e 1: q = £}| < [log(f)J + 1 for all i > 1, we have 

W t+1 = £ 

I=[q,s]EZ 

Z W t+ l{I)+ ^ ^+i( J ) 

/=[t+l,s]eX I=[q,s]eZ: 

q^t 

log(t + 1) + 1 + z Wt+1 (-0 ■ 

I=[q,s]eZ : 
q^t 

Next, according to the induction hypothesis, we have 

Z w t +i{I) = Z u> t (I)(l+Vi-I{t)-r t (I)) 

I=[q,s]eZ: I=\q,s]eZ: 

q^t q^t 

= W t + Z ^ • I(t) ■ r t(I) ■ w t (I) 

/el 

< t(log(f) + 1) + 2 w t {l) ■ n(l) . 

/el 


Hence, 


W t+ i «S t(log(t) + 1) + log(f + 1) + 1 + 2 w t (I) ■ r t (I) 

/el 

<(t+ l)(log(t + 1) + 1) + Z Wt ' r *C0 ' 

/el 

We complete the proof by showing that Yuiei w t(I) ' r t(I) = 0. Since x t = x / jt 
with probability p t (/) for every I e I, we obtain 

^ w t (/) • r/(t) = W t ^ pt{I)(£t(x t ) - i t (x t (I))) 

/el /el 

= 0 . 

Combining the above inequalities, we conclude the lemma. ■ 
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Proof (of Lemma [2D Fix some I = [ q , s] e I. We need to show that 


51og(s + 1)VPT ■ 

t = q 

Since weights are non-negative, using Lemma [lj we obtain 

u>s+i(I) ^ W s + 1 < (s + l)(log(s + 1) + 1) , 


Hence, 


ln(wi s+ i(/)) < ln(s + 1) + ln(log(s + 1) + 1) . 
Next, we note that 


( 5 ) 


yj s+ i(I) = + Vi ■ I(t) ■ r t {I)) = fj(l + iy ■ r t {I )) . 

t=q t=q 

Noting that r/j e (0,1/2) and using the inequality ln(l + ir) ^ x — x 2 which holds 
for every x ^ —1/2, we obtain 

S 

ln(u) s+ i(I)) = X ln ( 1 + vi ■ n(i)) 

t=q 

> tvi-n(i)- t^vi-n(i)) 2 

t=q t=q 

s 

^Vi(J]n(I)-r]i\I\) . ( 6 ) 

t=q 

Combining Equation (j6j) and Equation (0 and dividing by rjj, we obtain 

s 

X r t (I) Vi\I\ + + 1) + In (log (s + 1) + 1)) 

t=q 

< Vi\!\ + J?7 1 ( 1 °g(s + 1) + log(s + 1)) 

Vi\I\ + ZvI 1 log(s + 1) , 


where 

tuting 


the second inequality follows from the inequality x ^ ln(l + x). Substi- 


rjj := min ( 1/2, —)•, we conclude the lemma. 


A.2 Extending The Theorem to Any Interval 

In the next part we complete the proof of Theorem [I] by extending Lemma [2] to 
every interval. 

Before proceeding, we set up an additional notation and also make some 
simple but useful observations regarding the properties of the set I (defined in 
Section 0. 
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For an interval J c if, we define the restriction of Z to J by Z\ j. That is, 
Z\ j = {/ e I : / c J}, We next list some useful properties of the set Z that 
follow immediately from its definition (thus, we do not prove these claims). 

Lemma 4 

1. The size of every interval I e Z is 2 3 for some j e N u {0}. 

2. For every j e N u {0}, the left endpoint of the leftmost interval I whose 
size is 2 J is 2 J . Thus, the size of every interval which is located to the left 
of I is smaller than |/| = 2 J . 

3. Let I = [q, s] e Z be an interval and let I' = \q ', q — 1] be another interval 
of size 2 3 \I\ for some j ^ 0. Then, I' e Z. 

j. Let I = [g, s] e Z be an interval and let I' = [s + l,s'] be a consecutive 
interval of size 2 J \I\ for some j < 0. Then, I' e Z. 

5. Let I = [q, s] e Z be an interval of size 2 3 for some j e N u {0}. Then, 
(exactly) one of the intervals [q,q+ 2 J+1 —1], [s+l,s + 2 J+1 ] (whose size 
is 2 3+1 ) belongs to Z. 

The following lemma is a key tool for extending Lemma [2] to any interval. 

Lemma 5 Let / = [q, s] c pj be an arbitrary interval. Then, the interval I 
can be paritioned into two finite sequences of disjoint and consecutive intervals, 
denoted (I-k, ■ ■ ■, lo) — 1\i an d (h,l 2 , ■ ■ ■ ,Ip) Z\j, such that 

(Vi ^ 1) \I_i\/\I_i + i\ ^ 1/2 . 

(Mi >2) \Ii\/\Ii-i\ < 1/2 . 

The lemma is illustrated in Figure HOI We next prove the lemma. Whenever 
we mention Property 1,..., 5, we refer to Property 1,..., 5 of Lemma [4] 

Proof Let bo = max{|/'| : /' e I|/} be the maximal size of any interval /'el 
that is contained in /. Among all of these intervals, let Io be the leftmost 
interval, i.e., we define 

go := argmin{g' : [g',g' + b 0 - 1] e Z |/} 
s 0 = g 0 + b 0 - 1 
l 0 = [go,s 0 ] . 

Starting from go — 1, we define a sequence of disjoint and consecutive intervals 
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(in a reversed order), denoted (/_i,..., I-k), as follows: 


[g_i,s_i] := /_i 

:= argmax \I'\ 

I = [Q > s ]e2:| [g,gQ — 1] ■ 

s'=90-1 


\Q—ii s — i] I— i 

:= argmax \I \ 

I = [q i s l e -^l[9,g_i+i~i] : 
s'=<Z-i+i-l 


Clearly, this sequence is finite and the left endpoint of the leftmost interval, 
I-k, is q. Denote the size of I-i by b-i. We next prove that for every i ^ 1, 
b-i/b—i+i = 2 J for some j < —1. We note that according to Property [TJ it 
suffices to show that b-i < b-i + \ for every i ^ 1. We use induction. The 
base case follows from the minimality of Iq. We next assume that the claim 
holds for every i e {1,..., k — 1} and prove for k. Assume by contradiction that 
b-k ^ b-k+i■ Consider the interval I-k+i which is obtained by concatenating a 
copy of I-k+i to its left@. It follows that I-k+i is an interval of size 2b- k +i which 
is contained in [q, q-k +2 — 1] and its right endpoint is q~k +2 — 1- According 
to the induction hypothesis, \I- k +i\ = 2b- k +i = 2 J • b-k +2 for some j ^ 0. 
It follows from Property [3] that I-k+i e X \/, contradicting the maximality of 
I-k+l- 

Similarly, starting from so + 1, we define a sequence of disjoint and consec¬ 
utive intervals, denoted (Ji,..., I p ): 

[qi,si] := h 

:= argmax \I'\ 

I '=[?' .s'le^lpo+i.ap 
g'=so + l 


[qii Si] Ii 

:= argmax \I'\ 

T = [o ,, ,S , ]e 2 :|[s i _ 1 + l,a] : 
q'=si- i + l 


Clearly, this sequence is finite and the right endpoint of the rightmost interval, 
I p , is s. Denote the size of A by We next prove that for every i ^ 2, 
bi/bi -1 = 2-' for some j < —1. According to Property [1] it suffices to prove 

6 Formally, I _ k+ 1 := [<j_ fc+ i - b_ k+1 ,q_ k+1 - 1] u I- k +i- 
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that bi < bi- 1 for every z > 2. For this purpose, we first note that b\ < 60 ; 
this follows immediately from the definition of bo. Hence, we may assume that 
bi/bi -1 e { 2 J : j ^ 0 } for every z e { 1 ,... ,p — 1} and prove that b p < b p -\. 
Assume by contradiction that b p ^ & p _i. Consider the interval I p -\ which is 
obtained by concatenating a copy of I p -\ to its right. It follows that I p -\ is an 
interval of size 2 & p _i which is contained in [s p _2 + 1 , s] and its left endpoint is 
s p -2 + 1. According to the induction hypothesis, |/ p _i| = 26 p _i = 2 J • 6 p _2 for 
some j < 1. We need to consider the following two cases: 

• Assume first that j < 0 (thus, b p -\/b p -2 < 1/2). Then, it follows from 
Property U that J p _i e I| 7 , contradicting the maximality of / p _i. 

• Assume that j = 1 (i.e., 6 p _i = 6 P _ 2 ). Then, using Property^ we obtain 
a contradiction to the maximality of Ik- 2 - 


We are now ready to complete the proof of Theorem [l] 

Proof (of Theorem |T|) Consider an arbitrary interval I = \ q, s | c [ 7’ |, and 
let I = li be the partition described in Lemma [5] Then, 

-Rsaol b (-0 ^ 2 ^saol B (^ i) 

isSO 

+ yi -^SAOL B (Ij) ■ (7) 

1 


We next bound the first term in the the right-hand side of Equation |7| • Ac¬ 
cording to Lemma [2] we obtain that 

2>saol b (^) 21/.r 

2^0 25$ 0 

+ 5^] log(si + 1 )|A | 1/2 
2=$0 

2^0 

+ 51og( S + l )^|/ l | 1 / 2 . 

25$0 

According to Lemma [5j 


2^0 2 = 0 
2 ' 


< 


2 “ - 1 
2 

2 “ - 1 


I I\ a 
\I\ a ■ 
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Figure 1: Geometric Covering of Interval: The interval I = [1,30] is partitioned 
into the sequences (I_i = [l ],/_ 2 = [2, 3],/_i = [4, 7], Jo = [8,15]) and (/i = 
[16,23], h = [24,27], I 3 = [28,29], h = [30]) 


Similarly, we have 

i& v2- 1 

Combining the three last inequalities, we obtain that 

E i?SAOL^) < iJz^C\I\ a + 20log(s + l)\I\i . 
i^O Z 1 

The second term of the right-hand side of Equation © is bounded identically. 
Hence, 

Rsaolb(I) < 2 ^rT C l J l“ + 401og( B + l)|7|5 . 
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